To follow along, use the lab07 code and make sure the following packages are installed:

You can use the following code to perform the installation:

install.packages(c("dplyr","ggplot2","tidyr"))

Now load the packages

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("ggplot2")
library("tidyr")

Overview

Recall that a regression model has the following 4 main assumptions:

Throughout this lab, we will investiage these assumptions by simulating data that violates the assumption and determining how the diagnostics (including plots) express this violation.

Simulation

Throughout this lab, we will be

  1. simulating data,
  2. running a regression with those data, and
  3. looking at the diagnostics.

Simulate data

For fully reproducible results, we will set the seed.

set.seed(20170320)
n <- 100 
x <- rnorm(n)
b0 <- 1
b1 <- 2
sigma <- 0.4
error <- rnorm(n,0,sigma)
y <- b0+b1*x+error

d <- data.frame(x=x,y=y)

We can visualize the data

plot(y ~ x, data = d)

or using ggplot

library("ggplot2")
ggplot(d, aes(x=x, y=y)) +
  geom_point() +
  theme_bw()

Run the regression

m <- lm(y ~ x, data = d)
m
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Coefficients:
## (Intercept)            x  
##       1.037        2.011
summary(m)
## 
## Call:
## lm(formula = y ~ x, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98181 -0.28723  0.01533  0.28822  0.88930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.03697    0.03714   27.92   <2e-16 ***
## x            2.01063    0.03706   54.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.371 on 98 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9674 
## F-statistic:  2943 on 1 and 98 DF,  p-value: < 2.2e-16

Plot the line

plot(y ~ x, data = d)
abline(m, col='blue')

or using ggplot

ggplot(d, aes(x=x, y=y)) +
  geom_point() + 
  stat_smooth(method = "lm", se = FALSE) +
  theme_bw()

Look at diagnostics

opar = par(mfrow=c(2,3))  # Create a 2x3 grid of plots and save the original settings
plot(m, 1:6, ask=FALSE)   # Plot all 6 diagnostics plots

par(opar)                 # Return to original graphics settings

These are what plots look like when all model assumptions are satisfied.

Activity

Simulate data the following data

  • There are 234 total observations.
  • Explanatory variable is a random uniform on (0,1).
  • Intercept is -0.5, slope is 25, variance is 9.

Then run the regression diagnostic plots.

Normality

There are many ways for the errors to not have a normal distribution. We will consider errors that are heavy-tailed, right-skewed, and left-skewed.

n <- 100
errors <- data.frame(rep = 1:n,
                     normal       = rnorm(n),
                     heavy_tailed = rt(n, df=3),
                     right_skewed = exp(rnorm(n))) %>%
  
  mutate(right_skewed = right_skewed - exp(1/2),  # make sure errors have expectation of 0
         left_skewed  = 0 - right_skewed) 

errors_long <- errors %>%
  tidyr::gather(distribution,value,-rep) %>%
  mutate(distribution = factor(distribution, 
                               levels = c("normal",
                                          "heavy_tailed",
                                          "right_skewed",
                                          "left_skewed")))

Before considering the regression, let’s take a look at these errors

ggplot(errors_long, aes(x=value)) + 
  geom_histogram() + 
  facet_grid(. ~ distribution) + 
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(errors_long, aes(sample=value)) + 
  stat_qq() + 
  facet_grid(. ~ distribution) + 
  theme_bw()

The stat_qq() function does not have an option to include the reference line, but this line can be added using this code.

In base R, we can use the qqnorm() and qqline() functions to manually create the qqplot.

opar = par(mfrow=c(1,4))
qqnorm(errors$normal, main="Normal")
qqline(errors$normal)
qqnorm(errors$heavy_tailed, main="Heavy-tailed")
qqline(errors$heavy_tailed)
qqnorm(errors$right_skewed, main="Right-skewed")
qqline(errors$right_skewed)
qqnorm(errors$left_skewed, main="Left-skewed")
qqline(errors$left_skewed)

par(opar)

Regression

Of course, we don’t see these errors directly. Instead, we obtain the residuals which are estimates of the errors.

For simplicity, we will assume the true regression model has intercept 0 and slope 0. Thus the response is really just the errors themselves.

y <- errors %>%
  mutate(x = rnorm(n))

models <- list()
models[[1]] <- lm(normal       ~ x, data=y)
models[[2]] <- lm(heavy_tailed ~ x, data=y)
models[[3]] <- lm(right_skewed ~ x, data=y)
models[[4]] <- lm(left_skewed  ~ x, data=y)
names(models) <- c("normal","heavy_tailed","right_skewed","left_skewed")

Normal

opar <- par(mfrow=c(2,3))
plot(models$normal, 1:6, ask=FALSE)

par(opar)

Heavy_tailed

opar <- par(mfrow=c(2,3))
plot(models$heavy_tailed, 1:6, ask=FALSE)

par(opar)

Right_skewed

opar <- par(mfrow=c(2,3))
plot(models$right_skewed, 1:6, ask=FALSE)

par(opar)

Left_skewed

opar <- par(mfrow=c(2,3))
plot(models$left_skewed, 1:6, ask=FALSE)

par(opar)

QQ-plots

Focusing on the qqplots

opar = par(mfrow=c(1,4))
for (i in 1:4) plot(models[[i]], 2, main=names(models)[i])

par(opar)

Activity

Using the same errors, assume an intercept of 10 and a slope of -5. Run the regression analysis and view the QQ-plots.

Constant variance

In this section, we will assume the errors are normally distributed but look at what happens when the variance is not constant. The most common scenario is for the variance to increase as the expected response increases. If the slope is positive (negative), this means as the explanatory variable increases (decreases) the error variance increases.

Assume positive slope

n  <- 100
b0 <- 0
b1 <- 1
d  <- data.frame(x = runif(n)) %>%
  mutate(error = rnorm(n,0,x),
         y = b0+b1*x+error)
  
m <- lm(y~x, data=d)
opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)

par(opar)

Residuals vs fitted values

A (side-ways) funnel pattern is observed in the residuals vs fitted values plot.

plot(m, 1)

Absolute residuals vs fitted values

A half funnel pattern is observed in the absolute residuals vs fitted values plot and the (red) smoother line indicates an increasing pattern.

plot(m,3)

Activity

Simulate data that has a negative slope and an error variance that increases with the fitted value.

Independence

There are many ways to violate the assume of independence of the errors in linear regression. The most common way is for there to be serial correlation, i.e. that the errors are related due to the time the samples were obtained. If you know when the samples were obtained, you should plot the residuals vs that time. Often the dataset does not have this information, but typically the observations are recorded sequentially. So, we can use the row number as a proxy for the observation time. Thus at a minimum, I suggest you plot residuals vs row number.

Typically, you will have to create the row number (or index) variable.

Independent errors

n <- 100
b0 <-1 
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
  mutate(
    error =rnorm(n),
    y = b0+b1*x + error)

m <- lm(y ~ x, data = d)

The m object has several components take a look using the names() function.

names(m)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

The residuals component provides the fitted values in the same order as they were in the data set, so plot these.

plot(m$residuals)

Alternatively with ggplot and an easy smoothing line

dd <- data.frame(index = 1:length(m$residuals),
                 residuals = m$residuals)
ggplot(dd, aes(index, residuals)) +
  geom_point() + 
  stat_smooth(se = FALSE) +
  theme_bw()
## `geom_smooth()` using method = 'loess'

Serially correlated errors

d <- d %>% 
  mutate(error = sin(2*pi*(1:n)/n)+rnorm(n),
         y = b0+b1*x+error)

m <- lm(y ~ x, data = d)

dd <- data.frame(index = 1:length(m$residuals),
                 residuals = m$residuals)
ggplot(dd, aes(index, residuals)) +
  geom_point() + 
  stat_smooth(se = FALSE) +
  theme_bw()
## `geom_smooth()` using method = 'loess'

It is certainly not obvious that the errors have a sin wave structure, but it is pretty clear something is wrong since the residuals for early observations are positive and for later observations are negative.

There is nothing obvious about the default diagnostic plots.

opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)

par(opar)

Activity

Simulate errors that have a positive trend, fit the regression, and plot the residuals vs index.

Linearity

The linearity assumption assumes a linear relationship between the expected response and the explanatory variable. Thus, the linearity assumption is best looked at by plotting the residuals vs the explanatory variable and looking for a pattern.

Linear

If the relationship is truly linear, this plot will not show any patterns.

n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
  mutate(y = b0+b1*x + rnorm(n))
                
m <- lm(y ~ x, data = d)

d$residuals <- m$residuals

ggplot(d, aes(x, residuals)) + 
  geom_point() + 
  stat_smooth(se = FALSE) + 
  theme_bw()
## `geom_smooth()` using method = 'loess'

Non-linear relationship

n <- 100
b0 <- 1
b1 <- 2
d <- data.frame(x = rnorm(n)) %>%
  mutate(y = b0+b1*x+x^2 + rnorm(n))
                
m <- lm(y ~ x, data = d)

d$residuals <- m$residuals

ggplot(d, aes(x, residuals)) + 
  geom_point() + 
  stat_smooth(se = FALSE) + 
  theme_bw()
## `geom_smooth()` using method = 'loess'

This will also show up in the two residuals vs fitted values plots.

opar <- par(mfrow=c(2,3))
plot(m, 1:6, ask=FALSE)

par(opar)

But when the models get more complex, this relationship might not be clear from these plots. Therefore, I suggest you plot residuals versus each explanatory variable.

Activity

Simulate explanatory variables from a normal distribution with mean 1 and standard deviation 0.1. Then simulate data with an intercept of 1, slope of 2, and error standard deviation of 1. Plot residuals vs the explanatory variable. What pattern do you see?

Summary

In this lab, we considered the 4 main regression assumptions how violation of these assumptions get expressed in regression diagnostics. Of course, there is no reason that only a single assumption is violated at a time. Thus, you may want to simulate data with a variety of violates and see how those combinations of violations express themselves in the diagnostics. In reality, all the assumptions are violated all the time and we are trying to identify (via the diagnostics) when are the assumptions violated so badly that the model is ineffective for its intended purpose.